Skip to content

release: RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, coverage#47

Merged
jb-thery merged 8 commits into
mainfrom
develop
Jul 3, 2026
Merged

release: RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, coverage#47
jb-thery merged 8 commits into
mainfrom
develop

Conversation

@jb-thery

@jb-thery jb-thery commented Jul 3, 2026

Copy link
Copy Markdown
Member

Release PR — develop → main

Promotes the RAG retrieval, security, and coverage overhaul to production. Merging this PR triggers the protected Release npm workflow, which runs semantic-release to derive the version from Conventional Commits and publishes @jcode.labs/ragmir-tts then @jcode.labs/ragmir to npm.

Expected version bump

MINOR (e.g. 2.0.0 → 2.1.0), driven by the feat commits (RRF fusion, IVF_PQ index, config hardening). No breaking public-API changes.

What's in this release

  • feat(query): weighted Reciprocal Rank Fusion for hybrid retrieval (rank-only, no score calibration). Recall 1.0 on golden set.
  • feat(store): automatic IVF_PQ vector index above 256 rows (scalability beyond brute force).
  • fix(redaction): Luhn verification on credit cards, URL username redaction, Stripe/GitLab/Bearer providers.
  • feat(core): strict config schema, env-override warnings, access-log retention (10 MB cap), bounded LRU Transformers cache, CLI parsers extracted to testable module.
  • test: suite 132 → 151 cases / 23 files.
  • chore: dist/ is now gitignored build output.

Pre-merge verification

After merge

The Release npm workflow publishes both packages. No local publish, no direct push to main.

jb-thery added 8 commits July 3, 2026 17:47
chore: back-merge main into develop after 2.0.0 release
Move all packages/*/dist/ directories from committed artifacts to gitignored
build output. dist/ is regenerated locally with `pnpm build` before running the
CLI, MCP smoke, the library-API demo, or `pnpm validate`.

- .gitignore: ignore ragmir-core/dist, ragmir-tts/dist (already ignored for
  app/landing/license-webhook); add *dist catch-all.
- ci.yml: drop the `git diff --exit-code -- dist` step that enforced committed
  dist, since dist is no longer tracked.
- AGENTS.md, CLAUDE.md, README.md, library-api-demo README: document that dist
  is gitignored and must be built locally; warn against `npx ragmir` for local
  testing (resolves the published npm package, not the working copy).
Replace the weighted-sum fusion (vector and BM25 scores divided by their max)
with Reciprocal Rank Fusion, the standard hybrid-retrieval approach. Each
candidate scores `weight / (RRF_K + rank)` per retriever it appears in, summed
across retrievers, so the BM25 and vector score distributions never need
calibration against each other.

The vector retriever is weighted higher (0.7) than the lexical one (0.3)
because, with the default local-hash embeddings, vector proximity is the more
discriminant signal on small corpora; the lexical weight still lets exact-
keyword evidence pull in candidates the vector retriever missed.

- RRF_K = 60 (Cormack et al. 2009 constant).
- Remove the now-unused weighted-sum helpers (vectorScore, normalizeScore) and
  the normalizeForMatch import left dead by the refactor.

Retrieval recall stays at 1.0 on the sovereign-rag-demo golden set.
Above a 256-row threshold, automatically create an IVF_PQ index on the vector
column after writing the table. Below the threshold, LanceDB keeps using an
exact flat scan, which is optimal for small corpora and avoids wasted index-
training work.

- numPartitions ≈ sqrt(rowCount), clamped to [8, 1024] (LanceDB production
  heuristic).
- numSubVectors = 16 (divides the 384-dim local-hash/mxbai-xsmall vectors).
- index creation is idempotent (skipped if vector_idx exists) and best-effort
  (a training failure on edge-case dimensionality leaves the table usable via
  flat scan rather than failing the ingest).

This unblocks query scalability beyond brute-force scan without changing the
overwrite write path.
Close two confidentiality gaps and broaden provider coverage in the built-in
redaction patterns:

- credit_card: add a match-then-verify Luhn check (new RedactionPattern.verify
  field). Numeric runs that are not valid card numbers (version numbers,
  account IDs, hex runs) are left untouched instead of being over-redacted.
- url_credentials: extend the pattern so both the username and the password are
  redacted. Previously only the password was stripped, leaking the username.
- Add Stripe secret keys (sk_live/rk_live/sk_test), GitLab tokens (glpat-), and
  generic Bearer tokens. Order the more specific patterns before the generic
  api_token so they win on overlap.
- Add an optional `verify: "luhn"` to the RedactionPattern type so custom
  patterns can opt into the same check.
…d use

Several additive robustness and observability improvements, plus extraction of
the CLI option parsers into a testable module:

- config: make rawConfigSchema strict so unknown keys (typos) are rejected
  instead of silently ignored; warn on stderr when an env override (e.g.
  RAGMIR_TOP_K=abc) is invalid so operators notice a no-op override.
- access-log: bound the log growth with a soft cap. When the file exceeds
  10 MB, trim it to the most recent 50 000 lines before the next append, so a
  long-lived MCP server cannot grow it without limit or OOM a usage report.
- embeddings: bound the Transformers.js pipeline cache to 3 entries with LRU
  eviction, and export clearTransformersCache(). destroyIndex now calls it so a
  re-ingest with a different embedding config does not pin stale ONNX weights.
- cli-options: extract the pure option parsers (parsePositiveInt, parseNumber,
  parseRecallThreshold, audioEngine, audioAllowRemoteModels, audioLanguage,
  parseAgentInstallScope, parseAgentInstallMode) into a dedicated module so
  they can be unit-tested without importing commander. cli.ts imports them.
  parsePositiveInt now rejects fractional input like "1.5" instead of silently
  truncating via parseInt.
Close the test-coverage gaps the audit identified, raising the suite from 132
to 151 cases across 23 files:

- destroy.test.ts (new): destroyIndex removed flag and access-log entry.
- query.test.ts: ask() empty-sources and populated cited-retrieval branches.
- store.test.ts: empty-text-files manifest round-trip, removal on empty,
  missing, malformed, and malformed-entry filtering; writeRows zero-rows
  dropTable and full re-write.
- embeddings.test.ts: embedTexts([]) early return and clearTransformersCache.
- ingest.test.ts: --rebuild forces a full re-index (reusedFiles === 0).
- config.test.ts: strict() rejects unknown keys; non-object config rejected.
- access-log.test.ts: retention trims past 10 MB; disabled logging writes
  nothing.
- evaluate.test.ts: miss case (hit=false, bestRank=null, recall=0).
- redaction.test.ts: Luhn pass/fail, URL username redacted, Stripe/GitLab/
  bearer providers, obfuscation limitation documented.
- cli.test.ts (new): all cli-options parsers incl. the MP3-without-engine
  confidentiality guard and agent scope/mode validation.
- text.test.ts (new): tokenize/normalizeForMatch (the BM25 foundation).
…y-overhaul

feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage
@jb-thery jb-thery merged commit 17e20a1 into main Jul 3, 2026
10 checks passed
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant